Skip to content

refactor(supervise): substrate-agnostic TraceSource — sandbox-first trace analysis#320

Closed
drewstone wants to merge 10 commits into
mainfrom
feat/trace-source-sandbox
Closed

refactor(supervise): substrate-agnostic TraceSource — sandbox-first trace analysis#320
drewstone wants to merge 10 commits into
mainfrom
feat/trace-source-sandbox

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Corrects the trace-analysis layer I built router-only. Production is sandbox/fleet — the detectors must run there, and the SDK exposes the tool calls via the session (SessionMessage.parts / streamPrompt), not exportTrace (which is sandbox telemetry — my earlier red herring).

The fix

One interface over agent-eval's ToolSpan (the common currency), two source implementations:

  • createPushTraceSource — owned loops (router-tools, cli-bridge tool dispatch): the loop records each tool call.
  • sandboxSessionTraceSource(box, sessionId) — the box: box.messages({sessionId}) → session parts → decodeToolPart (defensive across OpenAI function + harness tool/tool_use shapes) → spans.

Two consumers ride the source: watchTrace (online → finding on the bus) and analyzeTrace (settle → agent-eval batch analyzers buildTrajectory/stuckLoopView/toolWasteView).

Deleted the premature router-only createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep.

Why

This is the §1.5 'author the interface, materialize per substrate' rule I violated by building a router-only onToolStep seam. ToolSpan is the shared currency; a new substrate implements one interface; the same published agent-eval detectors + analyzers run everywhere. Local testing via cli-bridge/router; staging/prod via sandboxes.

Verification

  • decoder across OpenAI + harness shapes; the sandbox box path end-to-end via a mock box (session parts → loop detected by the batch analyzer); owned-loop push path; online + settle consumers.
  • full suite 1023 pass; typecheck/build/lint clean; merges cleanly into main.
  • Honest gap: the exact live harness part-shape is validated on a real box / cli-bridge run — the decoder is defensive but the precise parts schema is confirmed against the running SDK, not the .d.ts.

🤖 Generated with Claude Code

drewstone added 10 commits June 16, 2026 14:36
… resume_worker

Close the bus to 100% bidirectional. The parent→child down-leg routes to the child
inbox (scope.send→deliver) AND records a queue:false event on the same bus: it lands
in history() + reaches subscribers for the audit trail, but is never pulled back by
the parent. New: resume_worker (continue a parked worker — the protocol had {resume}
but no verb); answer_question now routes the answer DOWN to the asking worker, unparking
it. EventBus gains PublishOptions.queue for record-only events.

down-leg + bidirectional history tests; full suite 1000 pass; typecheck/build/lint clean.
…iew gaps

Address PR #318 review:
- BLOCKING: answer_question computed `delivered` but returned only { question } —
  now returns { question, delivered }, consistent with steer_worker/resume_worker
  (no longer hides whether the answer reached a live worker).
- tests: answer routed down to a LIVE worker (delivered:true happy path); resume_worker
  delivered:false path; a focused event-bus queue:false unit test (history+subscribers
  see it, pull queue never does).
- resume_worker added to OPERATOR_TOOLS + the driver system prompt so the driver is
  actually prompted to use it.
Make the down-leg actually move a live worker (was observable-only). New createInbox
(supervise/inbox.ts) is the receive end an executor exposes as Executor.deliver; the
owned tool-loop (routerToolsInlineExecutor) drains it two ways:
- QUEUED (default): flush at each step boundary AND before the worker may settle — it
  can't finish while a steer/answer it never read is pending.
- FORCEFUL (steer_worker interrupt:true): aborts the in-flight turn so the worker
  re-plans immediately, breaking it off a wrong path mid-task.
Black-box CLI harnesses can't be interrupted mid-step → down-leg degrades to next spawn.

inbox 4 + executor-drains-inbox integration test (flush-before-settle proven end to end
through the real executor); full suite 1008 pass; typecheck/build/lint clean.
…sendDown covers answer

PR #318 audit follow-ups (non-blocking):
- resume_worker description no longer implies a park/resume lifecycle the scope model
  lacks — a settled (drained) worker is gone; says so and points to spawning fresh.
- sendDown now covers the 'answer' down-leg too (removes the inline bus.publish
  duplication; one helper for all three down kinds).
- history() docstring lists the down-leg event kinds.

full suite 1008 pass; typecheck/lint clean.
Simplify without losing capability:
- MERGE steer_worker + resume_worker → one steer_worker (any live worker; the only
  real axis was interrupt forceful-vs-queued, already a param). 'Resume' = a non-
  interrupt steer. Removes a redundant verb + dissolves the resume-vs-steer prompt nits.
- REMOVE await_next — it was a strict subset of await_event({kinds:['settled']}).
  One wait-verb now; callers/prompts pass kinds:['settled'] for the next finished worker.
- DROP bus.peek() — speculative, only its own test used it (YAGNI).

Down-leg event union + inbox shed the dead 'resume' kind. Full suite 1007 pass;
typecheck/build/lint clean.
…gent-eval kernel)

createDetectorMonitor (supervise/detector-monitor.ts) — the online analyst on the live
worker pipe. Folds each tool step through agent-eval 0.93.0's published streaming kernel
(repeatedActionDetector/errorStreakDetector — the SAME kernel control-runtime folds; no
detection logic reimplemented) and fires onSignal → a finding on the bus the moment a
worker loops or error-storms. routerToolsInlineExecutor feeds it via a new onToolStep seam.

Bumps agent-eval ^0.93.0. monitor tests (4); full suite 1011 pass; typecheck/build/lint clean.
Last mile: createCoordinationTools.raiseFinding (exposed on the MCP handle) — the seam
an ONLINE detector uses to publish a finding on the live bus mid-run. Proven end-to-end:
a stuck-loop on the worker pipe → monitor → raiseFinding → await_event surfaces it.

Review fixes (audit on the earlier commit):
- HIGH: AbortSignal.any (needs Node 20.3, floor is 20) → portable mergeAbortSignals.
- forceful interrupt: docstring no longer overpromises (aborts in-flight inference, a
  tool mid-exec finishes first); interrupted turns no longer count toward maxTurns;
  added the e2e test (forceful steer aborts the turn, re-plans, aborted turn is free).
- answer to a BLOCKING question is now delivered forcefully (interrupt) to unpark the
  worker immediately, not at its next boundary.
- sendDown 'answer' now REQUIRES questionId (overload; no silent ?? '' mask).
- tool-step status captured (error vs ok) for the error-streak detector.
- stale await_next purged from bench prompts + docs; history() docstring drops 'resume'.
- added tests: answer delivered:false + return asserted; await_event idle-on-mismatch.

full suite 1014 pass; typecheck/build/lint clean.
…es agent-eval)

createTrajectoryRecorder (supervise/trajectory-recorder.ts) — the post-hoc half of the
analyst pipe. Replays a worker's captured tool steps as agent-eval spans (InMemoryTraceStore)
and runs its PUBLISHED batch analyzers — buildTrajectory (structured run summary),
stuckLoopView (full-run repeated-call view, complementing the online consecutive detector),
toolWasteView. No analysis reimplemented; the thin bridge from live tool steps to the
substrate trace model. Feeds from the same onToolStep seam as the online monitor.

3 recorder tests (real spans → real agent-eval findings); full suite 1017 pass;
typecheck/build/lint clean. Closes both legs: online (mid-run) + settle (post-hoc).
…, comment accuracy)

- mergeAbortSignals listener leak: pre-link external signals ONCE; per-turn add+remove the
  listener (no accumulation on long-lived signals over maxTurns).
- interrupt catch now requires a real AbortError (DOMException) — a network fault coincident
  with an interrupt is no longer swallowed; rethrown.
- corrected the comment: an interrupted+re-planned turn DOES consume a maxTurns slot (bounded
  backstop, not a hang) — it just doesn't bill a turn.
- onToolStep is an observability side-channel: wrapped so a throwing monitor can't crash the
  worker loop; detector-monitor.observeToolStep also defends argHash on circular/unhashable args.
- projectEvent preserves questionId on the answer branch.
- stale await_next purged from skills/{supervise,loop-writer}; trimmed CLAUDE.md redundancy;
  softened the recorder's per-span-duration claim.

full suite 1018 pass; typecheck/build/lint clean.
… replace router-only seam

The detector/analyzer were built router-only (onToolStep/ToolStep) — premature; production is
sandbox/fleet. Corrected to one interface over agent-eval's ToolSpan:

- TraceSource (trace-source.ts): a worker's tool calls as ToolSpans, from an OWNED loop
  (createPushTraceSource — router/cli-bridge dispatch) OR a SANDBOX box
  (sandboxSessionTraceSource(box, sessionId) → box.messages() session parts → decodeToolPart,
  defensive across OpenAI + harness shapes). The SDK exposes tool calls via the session
  (SessionMessage.parts / streamPrompt), NOT exportTrace (sandbox telemetry) — corrected.
- watchTrace (online) + analyzeTrace (settle) now consume a TraceSource, not a router seam.
- DELETED the router-only createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep.

Common currency = ToolSpan; same agent-eval detectors + batch analyzers over any substrate.
trace-source 11 + watchTrace 5 + analyzeTrace 2 tests incl. the sandbox box path (mock box →
session parts → loop detected); full suite 1023 pass; typecheck/build/lint clean.

Live-box validation of the exact harness part-shape pending (decoder is defensive).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 1e7d7ffc

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-17T10:08:32Z

@drewstone

Copy link
Copy Markdown
Contributor Author

Re-opening from a correctly-based branch (this one carried the unsquashed #318 commits → false conflict).

@drewstone drewstone closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants